Reliable Evaluations of URL Normalization
نویسندگان
چکیده
URL normalization is a process of transforming URL strings into canonical form. Through this process, duplicate URL representations for web pages can be reduced significantly. There are a number of normalization methods. In this paper, we describe four metrics for evaluating normalization methods. The reliability and consistency of a URL is also considered in our evaluation. With the metrics proposed, we evaluate seven normalization methods. The evaluation results on over 25 million URLs, extracted from the web, are reported in
منابع مشابه
Normalization of qPCR array data: a novel method based on procrustes superimposition
MicroRNAs (miRNAs) are short, endogenous non-coding RNAs that function as guide molecules to regulate transcription of their target messenger RNAs. Several methods including low-density qPCR arrays are being increasingly used to profile the expression of these molecules in a variety of different biological conditions. Reliable analysis of expression profiles demands removal of technical variati...
متن کاملBrowser Extension TO Removing Dust Using Sequence Alignment and Content Matching
---------------------------------------------------------------------***--------------------------------------------------------------------Abstract If documents of two URLs are similar, then they are called DUST. Similarly, detection of near duplicate documents is complex. The duplicate documents content will be similar but there will be small differences in the content. Different URLs with sa...
متن کاملNormalization of Parents’ Response to Children’s Positive Emotions Scale
Abstract This study evaluated the normalization of the Persian version of the Parents’ Response to Children’s Positive Emotions Scale (PRCPS). For evaluating reliability and validity of this scale through random sampling, 400 mothers of 4-7-year-old children completed the PRCPS and Cognitive Emotion Regulation Questionnaire (CERQ). Evaluating internal reliability of PRCPS subscales by Cronba...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملBelief Functions on MV-algebras of Fuzzy Sets: An Overview
Belief functions are the measure theoretical objects Dempster-Shafer evidence theory is based on. They are in fact totally monotone capacities, and can be regarded as a special class of measures of uncertainty used to model an agent?s degrees of belief in the occurrence of a set of events by taking into account different bodies of evidence that support those beliefs. In this chapter we present ...
متن کامل